Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coreos-boot-mount-generator: Always use mpath for /boot if for root #1022

Merged
merged 2 commits into from
May 19, 2021

Conversation

cgwalters
Copy link
Member

udev/90-coreos-device-mapper: Create label links in real root too

We're currently gating on ENV{DM_SUSPENDED}=="Active" but
10-dm.rules does:

ENV{DM_SUSPENDED}=="Active", ENV{DM_SUSPENDED}="0"
ENV{DM_SUSPENDED}=="Suspended", ENV{DM_SUSPENDED}="1"

So what I think is happening here is that our rule happens to
run before that kicks in, so we make the links once, but not
thereafter.

Change the condition to match what 13-dm-disk.rules from
LVM is doing.

Also slightly reorder the code and add some comments for extra clarity.


coreos-boot-mount-generator: Always use mpath for /boot if for root

If root is on multipath, then we know we must use it for /boot.
The current code is I believe racy because at the time the generator
runs, we're querying the current properties of the device at
/dev/disk/by-label/boot. But multipathd could still be in
the process of setting up and replacing what that symlink
points to.

https://bugzilla.redhat.com/show_bug.cgi?id=1944660


We're currently gating on `ENV{DM_SUSPENDED}=="Active"` but
`10-dm.rules` does:

```
ENV{DM_SUSPENDED}=="Active", ENV{DM_SUSPENDED}="0"
ENV{DM_SUSPENDED}=="Suspended", ENV{DM_SUSPENDED}="1"
```

So what I think is happening here is that our rule happens to
run before that kicks in, so we make the links once, but not
thereafter.

Change the condition to match what `13-dm-disk.rules` from
LVM is doing.

Also slightly reorder the code and add some comments for extra clarity.
@cgwalters
Copy link
Member Author

This doesn't textually conflict with #1011

Although the PR commit says:

Note though that we can't assume that root is also directly on top of the multipath device because of LUKS-on-root.)

Ooh. Right, so the code here then needs to be refined to check whether multipath is in the stack. But the rdcore code already knows that right? Feels like we should be able to communicate the expected path to /boot from the canonical code parsing root= in rdcore or so?

ENV{DM_ACTIVATION}!="1", GOTO="dm_label_end"
ENV{ID_FS_LABEL_ENC}!="?*", GOTO="dm_label_end"
ENV{ID_FS_UUID_ENC}!="?*", GOTO="dm_label_end"
ENV{DM_SUSPENDED}=="1", GOTO="dm_label_end"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I think this will help with some of the udev issues I'm working around in #1011.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to this area of the code, I've also removed the DM_ACTIVATION line in #1011, which makes the symlinks even more solid!

Comment on lines 68 to 76
# If the root device is multipath, hook up /boot to use that too,
# based on our custom udev rules in 90-coreos-device-mapper.rules
# that creates "label found on mpath" links.
# Otherwise, use the usual by-label symlink.
rootdev="$(findmnt -nvr -o SOURCE /)"
case "${rootdev}" in
/dev/mapper/mpath*) bootdev=/dev/disk/by-label/dm-mpath-boot;;
*) bootdev=/dev/disk/by-label/boot;;
esac
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think here we should key off of getargbool 0 rd.multipath instead as we do in #1011 (note not rd.multipath=default necessarily to be forward compatible with supporting non-default configs). That also avoids having to deal with the LUKS-on-multipath case entirely.

Copy link
Member Author

@cgwalters cgwalters May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually though there's a distinction between "this technology is in use" and "this technology is used for these devices".

It looks like dracut interprets rd.multipath as "must start multipathd in the initramfs". OK, fine. But that doesn't (outside of what you're proposing) normally imply that multipath is necessarily used e.g. for the root filesystem.

We are using the root=/dev/disk/by-label/dm-mpath-root as a way to express that.

This problem circles back a bit to whether or not we want to support having /boot on a physically distinct storage from the root filesystem. I am skeptical of the value of that. But OTOH it will work just fine if someone does it manually today so we probably need to not surprise people doing it by having their systems fail to boot after an upgrade.

So...I think e.g. Anaconda ends up being completely explicit about what needs to be done, you have things like rd.lvm.lv to tell dracut to wait for a specific LV in the initramfs, and root=/dev/mapper/fedora-root to wait for that block device. The /boot bit is in /etc/fstab so not normally mounted in the initramfs, but on traditional systems w/client side initramfs generation you have the target /etc/fstab in the initramfs, so the generator can pull that in too.

I think what we're iterating towards here is then we need to support being explicit about where to find /boot. Which I think basically should mirror how we handle / with a missing root= karg effectively meaning root=/dev/disk/by-label/root.

Now, we need to support the current case of rd.multipath=default. I think if boot= is not present, then we implicitly make it boot=/dev/disk/by-label/dm-mpath-boot.

But if there is a boot= karg, then we use it as the path to the device.

Now - we're being more explicit, but I think this may in turn require your proposed support for "day 1 multipath" to inject a boot= karg or so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like dracut interprets rd.multipath as "must start multipathd in the initramfs". OK, fine. But that doesn't (outside of what you're proposing) normally imply that multipath is necessarily used e.g. for the root filesystem.

I'm 100% with you there (see last paragraph of #1011 (comment)). In an ideal scenario, we don't have any handling of rd.multipath at all because all we really care about is that multipathd took ownership of the devices. So here, we'd have the boot mount have something like After=multipathd-settled.target.

A major reason multipath is special compared to other storage configuration options is that the label we look for is already present even before multipath takes ownership. Whereas you don't have that with the rootfs on e.g. LUKS, where the symlink simply doesn't appear before it's actually ready to be used. Or e.g. in full-disk RAID1, the constituent boot partitions don't have a boot label.

But zooming out here, it must be that the boot partition will always have label boot because that's how GRUB itself finds it and boots the machine (apart from the special logic it has for RAID1). So a boot karg would be redundant in that respect and introduce another source of truth. It'd only really be useful for the multipath scenario, where we need to deal with that funky situation of /dev/disk/by-label/boot changing meaning. But again, in that case I'd prefer to figure out a way with multipath devs to get what we really want (wait until ownership is taken and symlinks updated).

Until then, I think it's OK if we have rd.multipath implying that the boot partition (and the root partition at least on the first boot prior to any possible reconfiguration) is on multipath. And while conceptually I agree that it shouldn't, there isn't really any other reason to have multipath enabled in the initrd since the primary purpose of the initrd is to set up the rootfs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, good argument! Indeed we can't make boot too special because GRUB has to read it. (But...on the flip side GRUB has unfortunately grown more stuff over time, though it seems unlikely to do multipath).

If root is on multipath (which is today for CoreOS always `rd.multipath=default`)
then we *know* we must use it for `/boot`.  We're not going to
support "tearing" where `/boot` is on a non-mpath device but
`/` is on mpath.

The current code is I believe racy because at the time the generator
runs (and systemd generators run *early*), we're querying the
"current" properties of the device at
`/dev/disk/by-label/boot`.  But multipathd could still be in
the process of setting up and replacing the target of that
symlink.  This can cause systemd to tear down and reinitialize
the mount, causing races.

https://bugzilla.redhat.com/show_bug.cgi?id=1944660
@cgwalters cgwalters force-pushed the multipath-predictable branch from 560b996 to 55723e8 Compare May 19, 2021 00:01
Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note, but LGTM as is too.

# See discussion in https://github.com/coreos/fedora-coreos-config/pull/1022
bootdev=/dev/disk/by-label/boot
# Yes this isn't a real karg parser but we're trapped in this shell script
if grep -q rd.multipath /proc/cmdline; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: this will incorrectly trigger on rd.multipath=0 (see https://github.com/dracutdevs/dracut/blob/9355cb8ea5024533210067373657dc337d63ecb9/modules.d/90multipath/multipathd.service#L12). I mean, a user really shouldn't have to do that, though still would be good to be resilient here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the real root; I'm uncertain about pulling in dracut-lib.sh into the real root. We could try to ship rdcore in the real root too and move this there?

Or I could special case rd.multipath=default for now...

Or it looks like coreos-liveiso-autologin-generator has have_karg that we could try to factor out into a little shell library shared by our shell generators...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this as is so I can cherry pick it, and figure out a cleaner karg handling later?

@cgwalters cgwalters merged commit 508afe3 into coreos:testing-devel May 19, 2021
@cgwalters
Copy link
Member Author

Hm we don't have an rhcos-4.8 branch here yet...should we do that now? And should it be cut from where openshift/os is i.e. d2776ed or from current tip?

@jlebon
Copy link
Member

jlebon commented May 19, 2021

Hm we don't have an rhcos-4.8 branch here yet...should we do that now? And should it be cut from where openshift/os is i.e. d2776ed or from current tip?

Hmm yeah I'd say it's about time now. Maybe let's ask on Slack though I'd lean towards cutting from where openshift/os is currently pinning.

@cgwalters
Copy link
Member Author

OK, will give this a bit of time for anyone else to weigh in otherwise I'll just create rhcos-4.8 at d2776ed to start, and if we want to bump it (which we do) that can be another PR.

@cgwalters
Copy link
Member Author

(Thinking about this a bit more, perhaps what would be most ergonomic actually would be keeping openshift/os main using fcos as a git submodule, but when we branch actually fork it and stop using a submodule. Then we wouldn't have rhcos- branches in this repo, and doing updates would be a single PR and not two)

@cgwalters
Copy link
Member Author

To ssh://github.com/coreos/fedora-coreos-config
 * [new branch]      rhcos-4.8 -> rhcos-4.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants